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Intrinsity  FastMATH™ 
Vector  and  Matrix  Math  Processor 


Optimized  for  real-time  and  adaptive  signal  processing  needs: 


Innovative  architecture: 

■  2  GHz  SIMD  4x4  matrix  engine 
with  multiprocessor  scalability  due 
to  high  bandwidth  Rapid IO™ 
interfaces 

■  Fixed-point  math 

■  High-level  (e.g.,  C)  language 
programmable 

•  Compiler  built-in  matrix  intrinsics 

•  Vector/matrix  library 


On-chip  matrix  coprocessor  and 
MIPS32™  ISA  RISC  core 

4x4  array  of  processors,  each 
with  sixteen  32-bit  registers,  two 
40-bit  MACs 

64  GOPS  (peak) 

Matrix  and  vector  math  native 
instructions:  1-,  8-,  16-,  32-bit 
support;  convenient  complex  math 

Descriptor-based  DMA  controller 

1  Mbyte  on-chip  cache-coherent 
L2  cache 


Speed  plus  an  architecture  designed  for  parallel  computations 
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Intrinsity  FastMATH  Vector  and 
Matrix  Math  Processor 


2  GHz  MIPS® 
scalar  engine: 
dual  issue 
instructions 


1  GB.'s, 


1  GB.'S 


2  GHz 

interconnected 
4x4  matrix 
processor  with 
16  registers 


DER‘400 

SDRAM 


J.i  GB.'s 


RapidIO  ports  balance 
I/O  and  processor 
speed 
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Matrix  Register  Arithmetic: 
Element-by-Element 


The  matrix  engine  has  16  matrix  registers,  each  with  16  32-bit  values. 
Halfword  and  word  arithmetic  is  supported. 


Single  instruction,  element-wise  addition  of  two  4x4  matrices 


Matrix 

Registers 


m2 

/ 

Re 

Iml 

0,0 

0,1 

0,2 

0,3 

0,0 

0,1 

0,2 

0,3 

0,0 

0,1 

0,2 

0,3 

1,0 

1,1 

1,2 

1,3 

_ 

1,0 

1,1 

1,2 

1,3 

+ 

1,0 

1,1 

1,2 

1,3 

2,0 

2,1 

2,2 

2,3 

2,0 

2,1 

2,2 

2,3 

2,0 

2,1 

2,2 

2,3 

3,0 

3,1 

3,2 

3,0 

3,1 

3,2 

3,3 

3,0 

3,1 

3,2 

3,3 

Mi 


M, 


or 


Complex  data 
by  halfwords 


Word  data 
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Matrix  Register  Arithmetic: 
Matrix  Multiplication 


Matrix-multiply  of  two  4x4  submatrices  by  halfword,  for  example  to 
support  16-bit  complex  arithmetic 


One  instruction 

■  Four  cycles 

(2  ns  @  2  GHz) 

■  128  operations 

3 

ZM0h(0,k)  x  M1h(k,0) 

k=0 


V 


Matrix 

Registers 


0,0 

0,1 

0,2 

0,3 

1,0 

1,1 

1,2 

1,3 

2,0 

2,1 

2,2 

2,3 

3,0 

3,1 

3,2 

3,3 

M, 


«►  matmulhh.m.m  M2, MO, Ml 


for  i  =  0  to  3 
for  j  =  0  to  3 
sum  =  0 

for  k  =  0  to  3 


High-high  halfword 
multiply,  e.g.,  re  x  re 


sum  =  sum  +  MOh(i,k)  X  Mlh(k,j); 
M2h  (i,  j )  =  sum; 


0,0 

0,1 

0,2 

0,3 

1,0 

1,1 

1,2 

1,3 

2,0 

2,1 

2,2 

2,3 

3,0 

3,1 

3,2 

3,3 

X 


0,0 

0,1 

0,2 

0,3 

1,0 

1,1 

1,2 

1,3 

2,0 

2,1 

2,2 

2,3 

3,0 

3,1 

3,2 

3,3 

Can  subdivide 
large  matrices 
into  4x4 
parts  for 
multiplication 


M 


o 


M1 
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Matrix  Register  Arithmetic: 

W  Block  Rearrangement  for  Parallelism 


cache 


User  1 

User  2 

User  3 

User  4 

mO 

ml 

m2 

m3 

6  elements 

of  1  user 

mO 

ml 

m2 

m3 

4  elements 

of  4  users 

mO 

■  □□□ 

ml 

■  ■■1 

m2 

■  ■■1 

m3 

1  element 

■  □□□ 

of  16  users 

■  □□□ 

«□□□ 

■  ■■□ 

■  ■■□ 

Load  4  or  16  data  streams 
(users)  and  re-block  for 
SIMD  parallel  processing 

-  Original  register  load 
instructions 

■  block4  (four  cycles): 
matrix  operations  on  four 
streams 

”  For  SIMD  operations  on  16 
parallel  data  streams: 
continue  rearrangement 
with  block  data  movement 
instructions — 70  cycles  (35 
ns)  total 
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FastMATH  Performance  Example: 
Fast  Fourier  Transform 


Matrix  architecture  plus  cycle  speed  combine  approximately  equally 
for  advantage  on  this  key  benchmark 

1  K  Radix-4  FFT,  16-bit  complex  data 


600,000 

500,000 

400,000 

300,000 

200,000 

100,000 

0 


Notes:  Competitive  data  from  published  benchmarks 
Competitive  clock  rates  are  highest  announced 


Processor 


■  FastMATH  @ 
2  GHz 

■  TMS320C6416 
@  600  MHz 

■  MSC8101  @ 
300  MHz 

□  TMS320C6203 
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Multiple  antennas 


N  samples  of 
16-bit  complex 
data 


N  frequency 
coefficients  for 
each  antenna 


FastMATH  Performance  Example: 
FFT  to  Implement  OFDM 


1 

r 

Appoint 

FFT 

A/-point 

FFT 


Front-end  processing 
(e.g.,  FIR  filter) 


Orthogonal 
Frequency-Division 
Multiplexing 

Smart  antenna 
beamforming  or 
symbol-rate 
processing 

Example  results: 

for  8  antennas,  10  Msamples  per  second,  1024-pt  complex  FFT: 
requires  14.4%  FastMATH  processor 
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FastMATH  Performance  Example: 

Smart  Antennas 


Modify  an  array’s  beam  pattern  to 
amplify  desired  signals  and 
suppress  interference 


v  v 


d\t) 


Weight  calculation 

R  =  ^ ,x(k)xH  ( k ) 


k  =  1 


w=R  iy^Jd*(k)x(k) 


k  =  1 


V  V  V 


Front-end  processing 


M  antennas 
N  samples 
from  each 


N  M-vectors 
of  16-bit 
complex 
samples 


Beamforming 

y(t)  =  wHx(i) 


R  =  Mx  M covariance  matrix 
d  =  Reference  signal 
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FastMATH  Performance  Example: 

Smart  Antennas 


Background 

•  More  users  than  antennas  =>  orthogonal  beams  not  possible 

•  No  a  priori  information  about  signal  directions  =>  need  real-time  adaptation 

•  Input  stream  is  16-bit  complex  data 

FastMATH  Implementation 

•  Covariance  matrix  calculated  by  complex  matrix-matrix  multiplications  on 
4x4  submatrices ,  then  re-assembling  full  matrix 

•  Covariance  matrix  inverted  by  Cholesky  decomposition;  use  block  matrix 
manipulation  instructions  to  rearrange  input  into  blocks  for  SIMD 
parallelization 

•  Beamforming  using  matrix-matrix  multiplications;  more  efficient  than  simple 
vector  math 

WCDMA  Example  Results 

•  With  64  voice  users  and  16  antennas,  4  rake  fingers  per  user,  weights 
updated  every  slot:  0.73  FastMATH  processors 
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Scaled  Multiprocessor  Example: 
CDMA  Multi-User  Detection 


•  Mitigate  interference  between  users  in  CDMA 

•  Solve  for  estimators  for  correct  symbols,  beginning  with  user-user 
correlation  matrix  Rand  user  input  vector  y 

•  Difference  equation  for  interference  on  symbol  m  of  desired  user  from  near¬ 
by  symbols  of  all  other  users: 


■  b is  desired  estimator  vector  for  symbol  mot  N users  to  be  found 

Implementation 

•  Jacobi  iteration:  Solve  for  matrix  B  of  M  symbols  for  N  users.  Perform  matrix- 
matrix  multiplications  distributed  over  processors 

•  Calculate  correlation  matrices  R  on  chip;  large  capacity  L2  cache  reduces 
data  transfer 

•  At  each  iteration  exchange  partial  results  over  RapidIO  port  via  DMA 

•  RapidIO  interfaces  work  in  background  in  parallel  with  computations  -  data 
transfer  time  efficiently  hidden 
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Scaled  Multiprocessor  Example: 
WCDMA  Short  Code  Multi-User  Detection 


</) 


CD 


Data  transfer  in  parallel  with  computation 

Scalable  multiprocessor  system  distributing  tasks  and  results  over  RapidIO 
interface  via  coherent  L2  cache 


1  Chip 


Add  2nd 
Chip 


Mitigate  user-user  interference 
in  WCDMA  via  MUD: 

Jacobi  algorithm 


T 

48  68 


4  Chips 


RapidlO-chained 
processor  array 

.  RapidIO 
Bandwidth 

Data  transfer: 
RapidIO  ports 
and  large  L2 
enable  up  to  134 
users 

-  PCI  Bandwidth 


T 

134  Users 
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